Skip to main content

Content Identification Overview

Cyberhaven makes use of content identifications engines to examine the data's content during movement and at rest, enhancing our understanding of its context. The correlation between the content and data lineage is essential to understanding the potential risks to your data.

The efficacy of content identifications directly depends on the content attributes you selected in your datasets. The content identifications engines continuously scan the source content, identifying any matches with these selected attributes.

NOTE

Cyberhaven only inspects content for matches against the attributes selected in your datasets. This means that you must carefully choose the attributes in your datasets to ensure effective content identifications.

The majority of content identifications processing is performed in the cloud; however, limited identifications is also performed on the endpoint.

When content identifications is triggered, the following actions take place.

  1. Sensor queries the cloud service to determine if the data has previously been inspected (based on a cryptographic hash of the content).

  2. The sensor uploads the content in question. The default file size limit is 25 MB which can be adjusted through remote configuration settings. Contact Cyberhaven Support for assistance.

  3. Cloud service extracts the scannable content from the objects and normalizes the it for submission to the inspection engines.

  4. Cloud service sends normalized data through a content identifications engine and all deployed content inspection policies are evaluated, this would include traditional content inspection policies as well as EDM if enabled.

  5. Should any matches be returned the resulting attribute matches are associated with the event.

  6. For customers using Content Capture, Cyberhaven sends a copy of the data and the content report to customer-controlled cloud bucket or buckets depending on configuration.

  7. Cyberhaven expunges the data from its service.

Cyberhaven also makes use of a limited local identifications engine in order to identify metadata-tagged files without the need to perform a cloud lookup. The local engine is triggered upon user actions as mentioned in Coverage for Tags identifications, Content identifications, and Content Capture.

Content identifications is enabled by default for new Cyberhaven instances.

Content identifications Across Data States

Based on whether data is at rest, in motion, or in use, Cyberhaven's content identifications engines adjust their scanning processes to match the selected datasets and enforce relevant policies.

Data in Motion: Real-time Content identifications

Cyberhaven analyzes content in real-time as it moves across applications, cloud storage, and external devices. This immediate analysis enables the content identifications engine to classify sensitive data and enforce policies on subsequent actions.

Content identifications is triggered by specific user actions, as detailed in our coverage documentation. The engines first check if the file has been previously scanned using its hash. If so, the relevant policy is immediately applied. If not, the file is scanned for sensitive content and classified.

Building on the real-time analysis of data in motion, Cyberhaven also inspects data as users interact with it within applications and browsers.

Data at Rest: Proactive Content identifications

    Cyberhaven proactively scans data at rest content and classifies them before users interact with the data. This functionality is currently available on Windows endpoints.

    By pre-scanning and classifying the data, Cyberhaven is able to immediately enforce policies and prevent unauthorized data transfers.

    The content identifications engine optimizes resource usage by verifying file hashes against previously inspected files to avoid duplicate scans.

    Data at rest scans are performed gradually over time to minimize resource usage. The scanned results are stored locally on the endpoint allowing the sensor to use the knowledge from the scans to enforce policies.

    You can view the events generated from data at rest scans on the Risks Overview page. The scans are logged under the “ch-dar-scanner” user.

    The Risks Overview page includes a new “Event Type” filter to help you distinguish the events from the data at rest events (“Scan Activity“) and data in motion or data in use events (“Data Activity”).

    To view events only from data at rest scanning, click on Event Type and select Scan Activity. Then select Unmatched policies.

What Content is Inspected?

Cyberhaven inspects a wide range of file types including text files, graphics, and document tags.

For text files, the content is directly compared against values defined in content identifier policies and rules. If the file contains document tags, then the tags are compared against defined policies.

For graphics, Optical Character Recognition (OCR) is used to extract and inspect text within images. See, Optical Character Recognition.

For files that are neither documents, nor images, such as audio files, the metadata is collected to provide contextual information.

Custom identifications Rules

Cyberhaven includes many out-of-the-box content identifications patterns. If you have additional patterns that you Cyberhaven to identify, you can create custom content identifications rules using regular expressions. See, Rule Editor.

Watch a video

Click here to download a video